Agenda

  • Recap
  • Machine Learning
  • Supervised vs. Unsupervised Learning
  • Basic Stats
  • Simple Linear Regression
  • Multiple Regression

Recap

Topics Covered

  • R Basics
  • Data types
  • Using functions
  • Writing functions
  • Working with Data
  • Reading data
  • Scraping data
  • Manipulating data
  • Visualization
  • Univariate plotting
  • Multivariate plotting
  • Facets
  • Rmarkdown reports

Machine Learning

ML Foundation

What is machine learning?

ML relies on computers to combine inputs and produce predictions from brand new data.

ML Framework

  • Label- The variable that we are trying to predict
  • Features- Input variables that are used to produce predictions
  • Models - Is the definition of the relationship between the features and the label.

Regression vs. Classification

In regression we predict continuous values, examples:

  • citibike users per day
  • temperatures
  • GDP

In contrast, classification predicts discrete values, examples:

  • landuse types
  • Building permit approvals

Supervised vs. Unsupervised Learning

What we have been covering so far is supervised learning! We give the computer/algorithm already labeled data so it knows what outcomes we want to predict.

  • Regression
  • Decision Trees

When you have no labels, unsupervised learning comes into play. You still have features to work with but allow the algorithm to come up with patterns or structures.

  • Hierarchical Clustering
  • K-Means Clustering

Simple Splitting

In our current data rich times, we have the luxury of paritioning our data (usually) in two parts. The first is the training set which is a subset(normally around 75-80%) of your total data to train your model, and the test set which is the rest of your data that you test your model against to compare results.

Simple Splitting Example

Validation Set

We can add another parition called the validation set allows us to tweak the model before we go on and test it against our hold-out test set. That way we can save our test data to confirm the model.

Occam's Razor

…simpler solutions are more likely to be correct than complex ones. When presented with competing hypotheses to solve a problem, one should select the solution with the fewest assumptions

The less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the sample.

Bias

Error due to bias is the amount by which the expected model prediction differs from the true value of the training data. It is introduced by approximating the complicated model by much simpler model. High bias algorithms are easier to learn but less flexible, due to this they have lower predictive performance on complex problems. Linear algorithms and oversimplified model lead to high bias in the model.

Variance

Error due to variance is the amount by which the prediction, over one training set, differs from the expected value over all the training sets. In machine learning, different training data sets will result in a different estimation. But ideally it should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in results.

Bias Variance Tradeoff

Model Completxity

Basic Stats

Correlation

Correlation is a measure of how two variables are interacting. The numbers are between -1 and 1, with 0 meaning no relationship.

\[r_{xy} =\frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{(n-1)s_xs_y}\]

  • \(\bar{x}\) is the mean of \(x\)
  • \(\bar{y}\) is the mean of \(y\)
  • \(s_x\) and \(s_y\) is the standard deviations of \(x\) and \(y\) respectively.

Correlation Plots

Covariance

The joint variability of two random variables. Unlike correlation which is dimensionless, covariance is in units obtained by multiplying the units of the two variables. \[\text{Cov}(X,Y) =\frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{n-1}\]

Regression

Regression Models

Simple Linear Regression

This is one of the if not more foundational model in machine learning. While the simplicity is stark, it actually is extremely powerful. Many of you have already encountered this tool growing up or other elementary statistics courses. You've seen this equation:

\[ y =mx+b\]

where:

  • \(y\) is the response variable
  • \(m\) is the slope of the line
  • \(x\) is the predictor variable
  • \(b\) is the y-intercept

Linear Regression ML

In machine learning, we right \(y = mx+b\) slightly differently.

\[ y = \beta_0 + \beta_1x_1\] where:

  • \(y\) is the response
  • \(\beta_0\) is the constant or y-intercept
  • \(\beta_1\) is the first coefficient
  • \(x_1\) the value of the predictor variable

Calculating Errors

The errors in this case is the deviation or vertical distance from the regression line to the actual data.

\[e_i = y_i - \hat{y_i}\]

Mean Square Error

In this calucation we are using \(n-\text{df}\) since we hav to account for degrees of freedom.

\[\text{MSE} = \frac{\sum_{i=1}^n (y_i-\hat{y})^2}{n-\text{df}}\] ## Root Mean Square Error

\[\text{s} = \sqrt{\text{MSE}} \]

Errors vs. Residuals

Errors and Residuals are almost identical, the main difference is on inference. If we have a fully known population, the deviation from the actual to the predicted is called an error. While if we take a sample distribution and take the deviations, they are called residuals.

Ordinary Least Squares

OLS is the most common method for fitting a regression line. This allows us to calculate the best-fitting line over the observed data. The criteria is to minimize the sum of the squared errors, and since the deviations are first squared, there are no cancellations between positive and negative errors.

Regression Coefficient

To calculate the regression coefficient or the \(\beta_1\) the equation is:

\[\beta_1 = \frac{\sum_{i=1}^n (y_i-\bar{y})(x_i-\bar{x})}{\sum_{i=1}^n (x_i-\bar{x})^2} = \frac{\text{Cov}(x,y)}{\text{Var}(x)}\]

R Simple Linear Regression

irisMod <- lm(Sepal.Length~Petal.Length, iris)
summary(irisMod)
Call:
lm(formula = Sepal.Length ~ Petal.Length, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.24675 -0.29657 -0.01515  0.27676  1.00269 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.30660    0.07839   54.94   <2e-16 ***
Petal.Length  0.40892    0.01889   21.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4071 on 148 degrees of freedom
Multiple R-squared:   0.76, Adjusted R-squared:  0.7583 
F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

RMSE

## MSE
sum(residuals(irisMod)^2) / df.residual(irisMod)
[1] 0.1657097
##RMSE
sqrt(sum(residuals(irisMod)^2) / df.residual(irisMod))
[1] 0.4070745

Plotting lm

lm Prediction

predict.lm(irisMod, data.frame(Petal.Length = 1:10))
       1        2        3        4        5        6        7        8 
4.715526 5.124448 5.533370 5.942293 6.351215 6.760137 7.169059 7.577982 
       9       10 
7.986904 8.395826 

Extrapolation

Multiple Regression

There are often times you need to model a phenomena with more than one predictor variable. That's when you need add more inputs. The general equation is:

\[y_i = \beta_0 1 + \beta_1 x_{i1} + \beta_2 x_{i2}+ \cdots + \beta_p x_{ip}\]

Multicollinearity

Where two or more independent variables in a multiple regression model are highly linearly related. This often reduces the power of a model to identify independent variables that are statistically significant.

  • Structural multicollinearity is a mathematical artifact caused by creating new predictors from other predictors — such as, creating the predictor \(x^2\) from the predictor \(x\).

  • Data-based multicollinearity on the other hand, is a result of a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which the data are collected.

Issues with Multicollinearity

  • When predictor variables are correlated, the estimated regression coefficient of any one variable depends on which other predictor variables are included in the model.

  • When predictor variables are correlated, the precision of the estimated regression coefficients decreases as more predictor variables are added to the model. More errors mean larger confidence intervals.

  • When predictor variables are correlated, the marginal contribution of any one predictor variable in reducing the error sum of squares varies depending on which other variables are already in the model. If one of the correlated variable is explaining the response variable there is less room for the other correlated variable to explain the response.

Heteroscedasticity

The residuals should be spread (relatively) equally along the ranges of predictors.

Multiple Linear Regression

You can add new parameters by using the \(+\) to append new input variables.

irisMod2 <- lm(Sepal.Length~Petal.Length + Petal.Width, iris)
summary(irisMod2)
Call:
lm(formula = Sepal.Length ~ Petal.Length + Petal.Width, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.18534 -0.29838 -0.02763  0.28925  1.02320 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.19058    0.09705  43.181  < 2e-16 ***
Petal.Length  0.54178    0.06928   7.820 9.41e-13 ***
Petal.Width  -0.31955    0.16045  -1.992   0.0483 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4031 on 147 degrees of freedom
Multiple R-squared:  0.7663,    Adjusted R-squared:  0.7631 
F-statistic:   241 on 2 and 147 DF,  p-value: < 2.2e-16

Three Dimensional Graphics

library(plotly)
plot_ly(data = iris, x=~Petal.Length, y=~Sepal.Length, 
        z=~Petal.Width, type="scatter3d", mode="markers", color =~Species)